This assignment is for ETC5521 Assignment 1 by Team WOMBAT comprising of Hai Hanh Ngo, Dewi Lestari Amaliah, Siyi Li, and Priya Ravindra Dingorkar.

1 Acknowledgements

Our sincere gratitude goes out for their guidance and encouragement to Professor Dianne Cook, Dr Emi Tanka, and the tutors Sayani Gupta and Sherry Zhang. Their culminating efforts have placed us in a position where we can generate this report collaboratively.

2 Introduction and Motivation

Getting data from the hotel industry has always been a real challenge. Although we can read about hotel jargons and words easily, searched through some of the standard descriptive statistics of the hotel industry in market research reports on the internet for the price of your next summer holiday, lesser is known about how the hotel actually works behind the scenes. In this article let us have sneak into the hotel data in detail and come up with interesting findings.

The “hotel booking demand datasets” compiled by Nuno Antonio, Ana de Almeida and Luis Nunes (Antonio, Almeida, and Nunes 2019) was a beautiful effort to overcome such challenge. This dataset was obtained from two hotels in Portugal - one city hotel in Lisbon and one resort in Algrave. Some of the sensitive information that could reveal the identity of the two hotels was not provided, but it did not affect the important role of this dataset for the purpose of education, management, machine learning and many others. Let us analyze and find out some intersting findings about the different workings of the hotels.

3 Research Questions

3.1 Primary Research Questions

Let us compare the efficiency of the two hotels, the City Hotel and the Resort Hotel, and try to understand the business of the hotels and what they really look like behind the scenes. Let us discuss how hotels manage the details of their clients, how bookings are priced, and how they are canceled bookings are managed.

3.2 Secondary Research Questions

  • Which months across the two year, saw the most inflow of the tourist ?

  • Let’s try to figure out the average daily rate of both types of hotels and figure out which type of hotel makes more money

  • Exploring the Hotels’ Market Segment, Customers’s preference for Booking

  • Which segment of the hotel market is more profitable and lets customers book their trip easily ?

  • Where Did The Bookings Come From ?

  • Let’s see, How do international guests like to reside in the hotels of Portugal ? How long do they they live there ?

  • Lets us further find out if there any relationship, between the customer type and the ADR(Average Daily Rate) in the two different hotel types

  • We, do not know, whether the meal type of the hotel data, can be inferred for the information of the length of stay of the guest. Let us try and find out if possible !!

  • In the generation of the added human intervention, cars play a huge role lets us find out whether there is any relationship between the the car parking space in the two different types of hotels mentioned in the dataset.

4 Data description

4.1 Dateset overview and structure

The datasets for the two hotels can be downloaded separately at ScienceDirect.com in the paper of Antonio, Almeida, and Nunes (2019). However, Mock (2020) at tidytuesday challenge had done us a favor and combined the two datasets into one. The two original datasets and the combined one can be obtained at this GitHub page.

For this study, we used only the combined dataset which is stored in a csv file format with 32 variables and 119,390 observations. Each observation represents one hotel booking.

Knowing this dataset belonged to a hotel, we can make sense most of the variables. However, not all of them are familiar for anyone who did not have a background of hotel management. We will run through some of the industry jargons and variables’meaning before we take a further look at the data.

Variables note:

  • is_cancelled: (1) if the booking was cancelled and (0) if not.
  • lead_time: number of days between when the booking was entered into the hotel’s booking system and the |arrival date.
  • meal: Type of meal booked which can be:
    • BB: Bed and Breakfast.
    • FB: Full board (breakfast, lunch and dinner).
    • HB: Half board (breakfast and one meal, usually dinner).
    • Undefined/ SC: no meal package.
  • country: Guests’country of origin.
  • market_segment: guests’market segment, some of which may associate with booking channel.
    • Direct: guests that make bookings directly with the hotels, could be from hotel’s website/ phone booking or walk-ins.
    • Corporate: Guests whom bookings are made by corporate/ company or guests who are business travellers.
    • Online TA: Online travel agents - bookings that made through a third party websites. Examples are Agoda, Expedia, Booking.com…
    • Offline TA/TO: Bookings made by Travel agents or Tour operators.
    • Complementary: Free stays offered for guests, usually from hotels’ promotional programs.
    • Groups: guests who travelled in groups.
    • Undefined: Undefined type of guests.
    • Aviation: We are not entirely sure but this could be airline crews.
  • Distribution_channel:
    • Direct: bookings that made directly with the hotels (hotel’s websites, phone or walk-ins)
    • Corporate: bookings made by corporate/company.
    • TA/TO: Travel agents/ tour operators/
    • Undefined: Undefined distribution channel.
    • GDS: Global Distribution System. GDS served like a hub for companies in the travel industry (airlines, hotels, car rental…) to connect with travel agents. Hotels will put some of their inventories (rooms) to the GDS and travel agents then can sell those rooms to their customers. Some of the well-known GDS include Amadeus, Sabre and Galileo.
  • is_repeated_guess: (1) if repeated and (0) if not.
  • customer_type:
    • Transient: Individuals or groups that occupy less than 10 rooms per night. These guests usually stay in the hotel short - term and require little services.
    • Contract: bookings bound by contracts, usually for more than 30 days for a consistent block of rooms.
    • Transient - Party: Transient booking but associated to other transient booking.
    • Group: bookings associated to a group, usually occupy more than 10 rooms per night.
  • previous_cancellations: number of previous cancellations prior to current booking by a customer.
  • previous_bookings_not_canceled: number of previous non - cancelled bookings prior to current booking by a customer.
  • booking_changes: number of changes made to the booking from when it was enterred into the system till the day of arrival/ cancellation.
  • agent: ID of travel agency that made the bookings.
  • company: ID of companies that made the bookings.
  • days_in_waiting_list: number of days booking was in the waiting list before it was confirmed to customers.
  • adr: Average Daily Rate, computed by taking total room revenue (excluded breakfast, tax and service charges) divided by total number of room nights sold.

Kindly note that you may notice that some variables contained the “NULL” values (eg: agent or company variable). This “NULL” value did not mean the value was missing, rather such value did not exist to begin with; for example a booking may not have the ID of an agent or a company associated with it as such booking was made by an individual customer.

4.2 Limitation

  • This dataset contained data for two specific hotels in Portugal. Such limitation in study objects introduced challenges when we attempted to explain the trends observed as reasons could be hotel- specific and we could not use industry knowledge to cover.
  • Some information of the hotels were not provided, for example the number of rooms, the occupancy rate, the location of the hotels (in the busy district or at the city suburban), years of operation or special events that might have occurred. The lack of information might render some of our questions unanswered.
  • Eventhough we were provided with the collection method, we were not be able to verify the validity and correctness of the data. We noticed during our analysis that some of the entries were not sensible and could very likely due to the input errors. However we were unable to verify such concern.
  • Since the observation began in July 2015 and ended in August 2017, we only have a fully cover data by year in 2016. Moreover, the coverage of the dataset in 2015 and 2017 are only six months and eight months, respectively. Hence, we would not analyze the data in year-wise manner because it might be not apple to apple to be compared.

4.3 Collection methods

Antonio, Almeida, and Nunes (2019) collected the data by extracting the variables from the hotels’ PMS (Property Management System) databases’ server with a TSQL query in SQL Server Studio Manager. The tables that were used to extract the variables are:

  1. BO (booking table in which the key, which is the ID, was retrieved).
  2. BL (bookings change log, in this case, if the booking details with respect to the day before arrival changed, the value used was the one present in this table).
  3. ML (meals).
  4. DC (distribution channel).
  5. TR (transaction).
  6. CP (customer profiles).
  7. NT (nationalities).
  8. MS (market segments).

A diagram below made by Antonio, Almeida, and Nunes (2019) presented the structure of the PMS databases:

PMS database diagrams

Figure 4.1: PMS database diagrams

4.4 Data Cleaning

4.4.1 Missing values checking

Bennett (2001) argued that it is important to take missing value into account, otherwise the statistical analysis will be misleading and variability of the data could not be estimated correctly. Thus, before analyzing the data, we checked the missing value using visdat package (Tierney 2017) first.

Variables Type and Missing Value Visualization

Figure 4.2: Variables Type and Missing Value Visualization

In children variable, when we found the missing value and we imputing it with the average of children (mean imputation) (Kang 2013) and created new variable called imputed_children. We added this newly created column in the original dataset.

Figure 4.2 shows that there is no missing value in the dataset otherwise. It is inline with what Antonio, Almeida, and Nunes (2019) stated that there is no missing values in the database table. However, we must take a note that some “NULL” values were presented which should be interpreted as “not applicable”, not a missing value (Antonio, Almeida, and Nunes 2019). For example, if the the company value is NULL, it means that the booking was not made by a company.

4.4.2 Data Transformation

The data is transformed to make it more structured. Properly structured and validated data boost data quality. Datatypes determine the visualisation, hence the correct datatypes, helps determine more accurate and better results. Keeping that in mind we have transformed the data type using mutate function from tidyverse Wickham et al. (2019) and as.factor function from R built-in base package (R Core Team 2020). Figure 4.3 portrays that the variables have been in a correct type.

Data Type Visualization

Figure 4.3: Data Type Visualization

We also created some variables by transforming or wrangling the original variable to analyze the data. Those variables are listed as follows:

  1. is_canceled_new. This variable is actually the same with is_canceled. We only recoded the value from 0 and 1 to be “not canceled” and “canceled” in order to make it easier to interpret.
  2. lengt_of_stay is the number of days that the guests spend in the hotel. It is the summation of stays_in_weekend_nights and stays_in_week_nights variable.
  3. stay_on is a categorical variable to observe whether the guests stayed in weekend, weekday, or both.
  4. number_of_guest is a summation of adults, imputed_children, and babies variables.
  5. kids is a summation of imputed_children and babies variables.
  6. family_type is a categorical variable to observe whether the guest stayed with kids or not.

The lengt_of_stay and stay_on would be used to analyze the staying pattern on the guests. Whilst number_of_guest, kids, and family_type would be used to analyze the type of guest who stayed at the hotel. Moreover, the dataset only provides the country of bookings with country codes coded in ISO code. Hence, we transform these code to the country name using countrycode package (Arel-Bundock, Enevoldsen, and Yetman 2018).

5 Analysis and Findings

Tourism and Seasonality in Portugal

The tourism industry worries a lot about seasonality, as it will affect the flow of visitors to tourist destinations. The hotel season is divided into two main seasons: high and low seasons. As the name suggests, high season is a busy season when the weather is good and the guests’ inflows are high; low season vice versa.

Portugal is no exception to this. The high season in Portugal usually runs in summer (June to September) and in spring (January to March); the beaches are usually the busiest in July and August. The low season generally occurs during the winter season, which begins around November and ends at the end of February. Weather during this time can display rainfall, unexpected rain and a strong, cold breeze which is not too ideal for sightseeing (lisbonlisboaportugal.com, n.d.).

Keeping the season in mind, we are interested in finding out any potential effect of seasonality on the guest count and the ADR of our hotels in the report.

5.1 Which months across the two year, saw the most inflow of the tourist ?

The inflow of visitors to these two hotels will help us decide the months in Portugal are the best time to travel to Portugal, and which hotel is the place to stay when you travel to Portugal.

Count of Number of Guests by each month in the two Different Years

Figure 5.1: Count of Number of Guests by each month in the two Different Years

Months are described in the order of occurrence (July was the first year of each year) to maintain the chronological order of the dataset. The overall pattern has been nearly the same for both hotels and years. The seasonality pattern was similar to the “W” shape, with the lower points of W occurring in the winter months from November to January, and the high points in the spring and summer periods. Interestingly, both hotels had the highest number of guests in the spring season in Year 1 (May and March) while in Year 2, the highest tourism was recorded in the summer season (August and October).

We can therefore infer that most of the months are a good time to visit Protugal, particularly from July to October and January to March, the graph above shows that most of the guests prefer to stay in City Hotel compared to the Resort Hotel.

5.2 Let’s try to figure out the average daily rate of both types of hotels and figure out which type of hotel makes more money

The average daily rate (ADR) calculates the average rental income received in an occupied room per day. The operating performance of a hotel or other lodging company can be calculated by using the ADR.

Average Daily Rate Versus Hotel Type

Figure 5.2: Average Daily Rate Versus Hotel Type

Figure 5.2 indicates a substantial gap between the two hotels. The resort had the highest summer time prices, which makes sense because Algrave is a beach town. Both hotels reported the lowest winter prices, but less fluctuation was observed by City hotel than by Resort hotel.

5.3 Exploring the Hotels’ Market Segment, Customers’s preference for Booking in the two hotel types

According to Meier (2017), one key to being effective in hotel management is a specific market segment, particularly to set the price correctly. It is therefore important to identify the market segment distribution for these hotels.

A study from Phocuswright in pegs.com (2016) stated that one of the major reason of why do the people book trough OTA is beacuse the website is easy to use. Meanwhilst, Howe (2017) argued that OTAs are favorable because its easy booking option. According to Jedina and Ranjinib (2017) in Talwar et al. (2020), the factors for booking the hotel through OTA are the accessibility, pricing, review accountability, and the customer services. Let us try to find out, whether this holds true for the Portugal hotels

The distribution of hotel market segment in the city and resort hotel (2015-2017)

Figure 5.3: The distribution of hotel market segment in the city and resort hotel (2015-2017)

Figure 5.3 shows that the vast majority of Online Travel Agency (OTA). We can clearly distinguish that, customers generally, make their bookings through online or offline agency and groups for City Hotel bookings. On the other hand, we witness that most customer make their resort hotel bookings through corporate or direct booking made by the customers. Moreover, in the city hotel the portion of bookings via OTA was greater than in the resort hotel. Unlike the city hotel, in the resort hotel we could see more portion of individual reservations than party bookings. This proves the study from Phocuswright, and hence people like to ease in booking their trips, hence they use the agency bookings.

5.4 Which segment of the hotel market is more profitable and lets customers book their trip easily ?

Study of which is the best way to book your ticket would allow customers to select their services when booking trips to the hotel

The distribution of hotel market segment in the city and resort hotel by semester

Figure 5.4: The distribution of hotel market segment in the city and resort hotel by semester

OTA has been the major player in the business segment of these hotels, it only took over the supremacy of group reservations in the city hotel in a half year period. In the first semester of 2016, the proportion of the OTA was doubled than in the previous semester.

The bookings through OTA have dominated since the beginning of the time observed, while at the resort hotel in comparison to the city hotel, the bookings via OTA marginally decreased in the first semester of 2016 and the group bookings increased or customers started booking on their own. We could also see that both hotels had hit the peak in the proportion of OTA bookings in semester 2 of 2016.

Aside from OTA matter, Figure 5.4 shows another interesting fact that in the condition of OTA booking was dominating the market segment, the proportion of direct booking in the resort hotel was relatively stable, this could be the reason may be cause this resort has their own way to promote their direct booking, for example may be through a loyalty voucher

5.5 Where Did The Bookings Come From ?

Another way to obtain more business information is to look at the roots of the travellers who book hotel space. An understanding of their actions and preference is essential. Therefore the hoteliers will establish strategies for attracting them.

We may examine which part of the world is most drawn to Portuguese

Figure 5.5: The distibution of travelers origin who books the hotels in 2015-2017

Figure 5.5 provides a booking map of the country of origin to get a view of the booking distribution, throughout the globe, The hover options, helps us with the count of the guests who have visited Portugal. If you hover over the map, it tell us that Portugal sees more tourists from Europe than the rest of the continents

As this could be a topic of interest for most of you and each individual, may want to know the count of guests that have visited Portugal. The interactive table below, allows you to get a detail view. The table suggests that the bookings came from 177 different countries.

Antonio, Nuno, Ana de Almeida, and Luis Nunes. 2019. “Hotel Booking Demand Datasets.” Data in Brief 22: 41–49.

Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman. 2018. “Countrycode: An R Package to Convert Country Names and Country Codes.” Journal of Open Source Software 3 (28): 848. https://doi.org/10.21105/joss.00848.

Bennett, Derrick A. 2001. “How Can I Deal with Missing Data in My Study?” Australian and New Zealand Journal of Public Health 25 (5): 464–69.

Howe, Neil. 2017. “Hotels Versus Otas: Who Is Winning over Millenial Travelers?” https://www.forbes.com/sites/neilhowe/2017/07/31/hotels-versus-otas-who-is-winning-over-millennial-travelers/#27440fd5277a.

Jedina, Mohd Haniff, and Kohila Ranjinib. 2017. “Exploring the Key Factors of Hotel Online Booking Through Online Travel Agency.” In 4th International Conference on E-Commerce (Icoec) 2017 Held in Malaysia.

Kang, Hyun. 2013. “The Prevention and Handling of the Missing Data.” Korean Journal of Anesthesiology 64 (5): 402.

lisbonlisboaportugal.com. n.d. “When to Visit Lisbon? The Best Time of Year for a Holiday to Lisbon and Wather.” https://lisbonlisboaportugal.com/lisbon-tour/lisbon-weather-when-to-go-visit.html.

Meier, Veit. 2017. “Market Segmentation - Know Where Your Hotel Demand Comes from.” https://www.bernerbecker.com/latest-articles/market-segmentation-know-hotel-demand-comes/.

pegs.com. 2016. “Why Do Travellers Prefer Booking with Otas?” https://www.pegs.com/blog/why-do-travelers-prefer-booking-with-otas/.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Talwar, Shalini, Amandeep Dhir, Puneet Kaur, and Matti Mäntymäki. 2020. “Why Do People Purchase from Online Travel Agencies (Otas)? A Consumption Values Perspective.” International Journal of Hospitality Management 88.

Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames.” JOSS 2 (16): 355. https://doi.org/10.21105/joss.00355.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.